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Abstract 

In  this  paper  we  consider  the  direct  solution  of  dense  and  tri- 
diagonal  linear  systems  of  n  equations  on  parallel  computers  using  the 
LDU  factorization  and  the  square  root-free  Givens  triangularization. 

We  show  that  a  dense  system  may  be  solved  using  LDU  factorization 

2 
without  pivoting,  or  Givens  reduction  in  T  =  0(n)  steps  with  p  =  0(n  ) 

processors.   The  resulting  speedup  over  the  sequential  algorithm  is  of 

2 
0(n  )  and  an  efficiency  independent  of  n.   If  pivoting  is  required  for  a 

stable  LDU  factorization,  the  system  can  be  solved  in  T  =  0(n  log^n)  steps 

2  2 

with  p  =  0(n  )  processors.   In  this  case  we  obtain  a  speedup  of  0(n  /log_n) 

and  an  efficiency  of  0(l/log_n). 

Using  Givens  reduction  or  LU  factorization  without  pivoting,  a 
tridiagonal  system  may  be  solved  in  T  =  0(log»n)  steps  with  p  =  0(n) 

resulting  in  a  speedup  of  0(n/log_n)  and  an  efficiency  of  0(l/log?n). 
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1.   Introduction 

Solving  a  system  of  linear  equations  is  one  of  the  most 
frequently  encountered  problems  in  numerical  computing.   Several  tradi- 
tional algorithms  are  available  for  solving  such  systems.   Using  a  serial 
computer,  a  system  of  order  n  can  be  solved  by  Gaussian  elimination  or 

LU  factorization  using  —  n  +  0(n  )  multiplications,  while  Givens  tri- 

h     3      2  12 

angularization  requires  —  n  +  0(n  )  multiplications  and  —  n  square 

roots. 

In  this  paper  we  consider  parallel  versions  of  the  LU  decompo- 
sition and  Givens  reduction  algorithms  for  both  dense  and  tridiagonal 

systems  of  equations.   For  each  algorithm  discussed  we  give  bounds  on 
the  time  and  number  of  processors  required  to  achieve  that  time. 

The  folio-wing  definitions  and  assumptions  hold  throughout 

this  paper.   If  p  processors  are  used,  p  >  1,  we  denote  the  computation 

time  by  T  .   Thus,  the  speedup  of  a  parallel  algorithm  which  uses  p 

processors  over  a  serial  algorithm  is  given  by  S  =  T,/T  >  1.  We 

pl'p  — 

define  the  efficiency  of  such  a  computation  as  E  =  S  /p  <  1.   This  can 
be  regarded  as  the  ratio  of  the  achieved  speedup  to  the  maximum  possi- 
ble speedup  using  p  processors.  We  also  assume  that 

(i)   any  number  of  processors  may  be  used  at  any  time, 
but  we  will  give  bounds  on  this  number, 
(ii)   each  processor  can  perform  any  of  the  four  arith- 
metic operations  in  one  time  step,  and 
(iii)   there  are  no  memory  or  data  alignment  time  penalties. 


In  Section  2  we  discuss  the  LDU  decomposition  algorithm.   We 

show  that  linear  diagonally  dominant  and  irreducible  systems  of  order 

2 
n  may  be  solved  in  T  =  0(n)  steps  using  p  =  0(n  )  processors.  While 

for  an  arbitrary  real  nonsingular  matrix,  p  =  0(n  )  and  T  =  0(n  log  n) 

steps.   The  extra  time  is  required  for  pivoting.   In  Theorem  3,   we  give 

a  new  result  for  solving  tridiagonal  systems.  We  show  that  a  system  of 

order  n  may  be  solved  in  T  =7  log  n  steps  using  no  more  than  p  =  Un 

processors  if  pivoting  is  not  required.   This  is  an  improvement  on  the 
results  in  [1], 

Section  3  contains  our  main  results.   We  adapt  the  Givens 

process  for  parallel  computation  so  that  any  nonsingular  linear  system 

2 
of  order  n  can  be  solved  in  T  =  0(n)  steps  using  p  =  0(n  )  processors. 

3  2 

Since  T     =  0 (n  )  we  achieve  a  speedup  S  -0(h)  with  an  efficiency  which 

is  independent  of  n.   This  is  a  substantial  improvement  over  the  LDU  de- 
composition with  pivoting.   Finally,  we  show  that  any  nonsingular  tri- 
diagonal system  of  order  n  can  be  solved  in  T  =10  log  n  +  0(1)  steps 

with  p  =  4n.   This  algorithm  is  more  stable  than  the  LU  decomposition  of 
Section  2  without  pivoting,  in  cases  where  pivoting  is  necessary  for  a 
stable  LU  factorization,  and  is  also  faster  than  the  result  of  [1]. 
Throughout  the  paper  we  denote  log~x  by  log  x. 


2.   LDU  Decomposition 

In  this  section  -we  consider  the  parallel  solution  of  dense 
and  tridiagonal  systems  of  equations.   For  dense  systems  we  discuss 
both  cases  when  the  matrix  of  coefficients  is  (i)  irreducible  and  dia- 
gonally dominant,  in  which  pivoting  is  not  required  [2],  and  (ii)  an 
arbitrary  real  nonsingular  matrix  where  pivoting  is  usually  required  for  a 
stable  decomposition.   An  algorithm  for  the  LDU  factorization  in  each  of 
the  two  cases  is  presented  in  Theorems  1  and  2,  respectively.   Lemma  1 
can  be  used  in  either  case  to  obtain  the  solution  by  forward  and  back- 
ward substitution.   It  should  be  noted  that  the  time  and  processors 
required  in  Lemma  1,  (T  =  0(n)),  do  not  change  the  order  of  the  results 

of  Theorem  1,  (T   =  0(n)),  or  Theorem  2,  (T   =  0(n  log  n)).   Hence, 

n  n 

the  results  of  the  theorems  can  be  taken  as  time  and  processor  bounds 

for  linear  system  solvers  of  this  type.   We  also  note  that  if  a  linear 
system,  once  factored,  is  to  be  solved  for  a  number  of  different  right- 
hand  sides  (not  all  at  the  same  time),  then  the  time  and  processors  of 
Lemma  1  are  all  that  we  need  for  each  right-hand  side. 

In  Theorem  3  we  present  an  algorithm  for  solving  tridiagonal 
systems.  Since  the  decomposition  and  the  forward  and  backsubstitution 
all  require  the  same  order  of  processors  and  time,  we  present  these  as 
one  theorem,  namely  T.   =  7  log  n. 

We  first  consider  the  case  of  dense  irreducible  and  diagonally 
dominant  systems. 


Theorem  1. 

Let  A  be  an  irreducible  diagonally  dominant  matrix  of  order  n. 
Then  it  can  be  factored  in  the  form  A  =  LDU  in  T  =  0(n)  steps  using 
p  =  0(n  )  processors,  where  L  and  U  are  unit  lower  and  upper  triangular 
matrices  respectively,  and  D  is  a  diagonal  matrix  with  nonzero  entries. 
Thus,  we  obtain  a  speedup  S  =  0(n  )  with  an  efficiency, 

ir 


2   „    2 


Proof: 


and 


Hence. 


Let 


L  =  [ix,  i2,  ...,  iQ],  U  =  [ux,  Ug,  ...,  uQ], 


where 


i\==  (o,  ...,  0,  i,  i^   k,  ...,  |  k), 


u\  =  (0,  ...,  0,  1,  uk^+1,  ...,  u^) 


n       t 
A  =  An  =  £  d.i.u  . 
1   .-ill 

i=l 


in  ■which  d.  is  the  i-th  element  of  D.  At  the  beginning  of  the  k-th 
stage  we  have  the  elements  (d  ,  d  ,    ...,    cL  ),    the  column  vectors 

[i.j_,  i2,  -..,  ^_{\f   !>-,_,  u2>  •••>   \_i^ 

and  the  matrix 

n"1       t 
A=  A  -  E  d.  I.   u  . 

K      1     .-11     1 
1=1 

whose  first  (k-l)  rows  and  columns  are  zero.   Thus 


\  ■  £!>   V  -    *{£/\> 


and 


u  ,   =  a)  '  /±     ,    r  =  k+1,  k+2,  ....  n, 

kr    kr/  k  '  '  '  '      ' 


are  obtained  by  one  division  using  2(n-k)  processors.   We  then  obtain 

t  2 

the  matrix  (ii)u   in  one  multiplication  using  (n-k)  processors, 

and  A,  ,  =  A,  -d,  H ,vl       in  one  subtraction  using  the  same  number  of 

processors.   Therefore,  three  time  steps  are  required  at  the  k-th 

2 
stage  using  max{2(n-k),  (n-k)  }  processors.  Hence  the  whole  factor- 

2 
ization  requires  T  =  3(n-l)  steps  with  p  =  (n-l)  processors  for  n  >  3- 

It  can  be  shown  that  the  sequential  algorithm  requires 

T  =  2n(n  -l)/3  steps.   Thus,  we  obtain  a  speedup,  S  =  2n(n+l)/9 


2        2 
with  a  corresponding  efficiency,  -r  <   E  £  —  ,  where  the  upper  limit  is 

achieved  for  n  =  2. 

If  A  is  symmetric  positive  definite  pivoting  is  again  not 
necessary  [3],  and  it  can  be  factored  in  3(n-l)  steps,  as  in  Theorem  1, 

using  no  more  than  n(n-l)/2  processors.   So  again  we  obtain  a  speedup  of 

2 
0(n  )  with  an  efficiency  greater  than  2/9. 

For  a  general  nonsingular  matrix  a  pivoting  strategy  is,  in 

general,  necessary  for  the  numerical  stability  of  the  decomposition. 

Adopting  a  partial  pivoting  strategy,  we  show  that  the  decomposition 

can  be  performed  in  0(n  log  n)  steps. 

Theorem  2. 

Let  A  be  a  nonsingular  matrix  of  order  n  =  2  ,  for  a 
positive  integer  m.   Then  it  can  be  factored  in  the  form  PA  =  LDU  in 
T  =  0(n  log  n)  steps  using  p  =  0(n  )  processors,  where  L  and  U  are 
unit  lower  and  upper  triangular  matrices,  respectively,  D  is  a  diagonal 
matrix  with  nonzero  entries,  and  P  is  a  permutation  matrix.   Thus  we 
obtain  a  speedup  S  =  0(n  /log  n)  and  an  efficiency  E  =  O(l/log  n) . 


Proof: 


The  proof  is  similar  to  that  of  Theorem  1.   At  the  k-th  stage, 


2 
k  =  1,  2,  ...,  n-1,  we  require  three  steps  as  before  using  (n-k)  pro- 


cessors, in  addition  to 


(n-k+1)] 


log  (n-k+1)   steps  for  searching  a  column 


vector  of  (n-k+1)  elements  for  the  pivot  using 


( n-k+1  )/2 


processors, 


Hence,  the  factorization  can  be  obtained  in  T  <  3(n-l)  +  n  log  n  steps 

2 
using  p  =  (n-1)   processors.   Ignoring  the  time  required  for  pivoting, 

2    2 
the  sequential  algorithm  requires  T-  =  —  n(n  -1)  steps.   Thus  for  large  n, 

2 
we  obtain  a  speedup,  S  -   2n  /3  log  n  with  an  efficiency,  E   -  2/3  log  n. 

Next  we  give  a  lemma  for  the  forward  and  backsubstitution  which 
are  used  to  obtain  a  solution  once  the  matrix  of  coefficients  is  factored. 


Lemma  1 

Given  the  LDU  factorization  of  PA,  then  the  linear  system  of 
equations  Ax  =  f  can  be  solved  in  T  =  0(n)  steps  using  p  =  0(n)  proces- 
sors with  a  speedup  S   =  0(n)  and  an  efficiency  E  =  0(1). 


Proof: 

The  solution  vector  x  can  be  obtained  by  sequentially  solving 
the  three  systems 

Lz  =  Pf  ,   Dy  =  z  ,   Ux  =  y  . 

The  lower  triangular  system  can  be  written  as 


g  z.i.  =Pf  =  f(l), 


i=l  X  X 

(k) 
hence   z   is  given  by  z     =   f £        (k  =   1,2,    ...,    n),   "where 

f(j+l)   =   fU)    -   zj.        j=l,2,...,    n-1 

J    J 

Therefore  each  of  the  triangular  systems  may  be  solved  in  2(n-l)  steps 
using  (n-1)  processors.   Since  Dy=z  can  be  solved  in  one  step  using  n 
processors,  the  total  time  for  solving  Ax=f,  using  p=n  processors,  is 
T  =  4n-3.   The  sequential  algorithm  requires  T..  =  n(2n-l)  steps, 
hence  we  obtain  a  speedup  S  ~  n/2  a^d  an  efficiency  E  ~  1/2. 

We  have  shown  [4,5]  that  a  triangular  system  of  equations 

2  1/3    2 

may  be  solved,  using  0 (n  )  processors,  in  o(n    log  n)  steps.   Since  we 

are  dealing  here  with  dense  matrices  of  moderate  sizes  (n  <_  1000,  say) 

2 
the  process  remains  essentially  of  0(n)  even  though  we  are  using  0(n  ) 

processors.   Hence,  for  higher  efficiency,  we  use  the  above  algorithm 

which  requires  only  n  processors. 


For  tridiagonal  systems  of  equations  it  has  been  shown  [1], 
that  the  direct  LU  factorization  of  the  matrix  of  coefficients  and  the 
solution  of  the  resulting  system  can  be  obtained  in  17  [log  nl  steps 
using  n  processors.   We  next  give  a  smaller  upper  bound  on  T  for  solving 
such  systems.   As  a  further  comparison  with  the  method  of  [1],  we  note 
that  using  Theorem  3  and  n  processors  a  tridiagonal  system  can  be  solved 
without  pivoting  in  just  16  log  n  steps. 


Theorem  3 


Let  A  he  a  nonsingular  tridiagonal  matrix  of  order  n=2  ,  for 


a  positive  integer  m.   Then  the  system  of  equations  Ax=f  may  be  solved 
using  the  LU  factorization,  A=LU  where  L  is  unit  lower  triangular  and 
U  is  upper  triangular,  in  T  =7  log  n  steps  using  p=Un  processors. 

IT 

This  results  in  a  speedup  S  =  O(n/log  n)  with  an  efficiency 

E  =  0(1/ log  n). 

p      ' 

Proof: 

It  has  been  shown  in  [A  ]  that  any  m-th  order  linear  recurrence 
system  of  n  equations,  or  any  unit  triangular  system  of  order  n  and 
band-width  (m+1)  may  be  solved  in, 

1     2 
T  =  (2  +  log  m)  log  n  -  —  (log  m  +  log  m) 

2 
steps  with  p  =  0(m  n)  processors.   In  [1]  it  was  shown  that  the  LU  factor- 
ization of  A  is  essentially  equivalent  to  solving  a  second  order,  m=2,  linear 


10 


recurrence  system  of  n  equations.   Thus  the  factorization  can  be 

obtained  in  T   =  (3  log  n  -  1)  steps  with  pn  =  kn.      Each  of  the  for- 
P-i  1 

ward  and  backward  sweeps  requires  the  solution  of  a  first  order  linear 

recurrence  system  of  n  equations.   Hence  the  time  required  for  these 

sweeps  is  T   =1+4  log  n  with  p  ■  n  processors.   Therefore,  the 
P2 

total  time  for  solving  the  tridiagonal  system  of  equations  is 

T   +  T   =  T  =  7  log  n  steps  with  p  =  An  processors.   This  results 
Px    P2    P 

in  a  speedup  S  =  8(n-l)/7  log  n  with  an  efficiency  E  -   2/7  log  n. 

Note  that  if  pivoting  is  necessary  for  a  stable  LU  factoriza- 
tion, then  the  scheme  in  [1]  is  not  satisfactory.   In  the  following 
section  we  introduce  another  algorithm  that  is  more  stable. 


11 


3.   Givens  Triangularization 

In  this  section  we  consider  the  parallel  solution  of  a 

general  linear  system  using  orthogonal  factorization  of  the  matrix  of 

2 
coefficients.   Theorem  k   shows  that  with  p  =  0(n  )  processors  we  can 

triangularize  any  real  nonsingular  matrix  in  T   =  0(n)  steps  using  a 

n  2 

modified  Givens  process.   Thus  we  achieve  a  speedup  S  p  =  0(n  )  and  an 

n 

efficiency  independent  of  n.   Lemma  2  shows  that;,  after  the  orthogonal 

factorization  of  the  matrix  of  coefficients,  we  can  solve  the  linear 

system  with  n  processors  in  0(n)  steps.   Thus  Theorem  k   determines  the 

order  of  time  and  processors  for  linear  system  solving  using  Givens 

reduction.   Again  we  note  that  if  a  linear  system  is  to  be  solved  for 

various  right-hand  sides ,  we  use  the  time  and  processors  of  Theorem  k 

once,  and  of  Lemma  2  for  each  right-hand  side. 

Finally  Theorem  5. shows  that  any  tridiagonal  system  of 
equations  may  be  solved  using  Givens  reduction  in  T  =10  log  n  steps 
with  p=l+n  processors. 

Now  we  consider  the  orthogonal  factorization  of  a  real  non- 
singular  matrix  A, 

Q  A  =  R  (1) 

where  Q,  is  an  orthogonal  matrix  formed  as  the  product  of  plane  rotations, 
and  R  is  upper  triangular.   This  standard  Givens  process  requires 
roughly  ^n  /3  multiplications  and  n  /2  square  roots,  where  n  is  the 
order  of  the  matrix  A. 


12 


Recently  [6,7]  it  has  been  shown  that  the  matrices  D  and  R  in 
the  modified  Givens  reduction, 


i 

Q  A  =  D2R  (2) 


can  be  obtained  using  \   the  multiplications  of  the  standard  Givens 
process  and  no  square  roots  if  Q  is  not  computed.   We  will  use  this 
modified  process  in  our  parallel  algorithm  to  avoid  square  root  evalua- 
tion after  every  transformation.   First. we  describe  a  numerically  stable 


A  Givens  transformation/      \  UvU,  ,■< 'U.  i  ■•■■•:  Up:   ■.■<■■  i  ■ 


version  of  this  square  root-free  Givens  reduction 

') 

Vk  u    s/k  u   ...  \'k  u  ... 

■J~£  u,       J~£ u       ...     \Ttt  u     . . . 
so  as  to  annihilate  the  first   element  of  the   second  row  would  produce, 


n/p  u±      \Tp  u2     ...     sTp  ur 


0  sTq  v        . . .     \Tq  v 


and  the  rotation  is  determined  by, 


c2  =  l/(l+op)  ,    s2  =  cp/(l+dp)  (3) 


where, 


13 


a 


=  iv^k^  ,         B  =  v1/u1  (U) 


Thus, 


i  &\  s   /£" 

u     =        u     +       V 

r  r-     J       r  /-  r 

•pi  1    ^p 


I     /k    1  ,         /  C     /iT    . 

v     =   -  u     +       V 


Simplifying   (5),   we  obtain 


u  =  u  +  a   v  , 
r    r      r 


v   =  -Bu  +  V 
r    K  r    r 


(5) 


There  are  several  choices  for  p  and  q;  one  such  choice  is 


p  =  c  k  ,   and   q  =  c  £.  (6) 


and  ( 5  ' ) 


Furthermore, 

p  =  k/(l+a  6) 
and  (6') 

q  =  i/(l+a  6)  . 

Unfortunately,  however,  the  method  suffers  from  the  possibility  of 
underflow  in  p  and  q.   This  may  be  avoided  by  interchanging  the  two 
rows  under  consideration. 
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On  a  serial  computer,  evaluating  u  ,  v   (r  =  1,  2 t) ,  p 

and  q  require  (4t+5)  steps,  while  on  a  parallel  computer  this  process 

requires  four  steps  with  (2t+5)  processors  as  shown  in  Figure  1.   In 

fact,  we  can  apply  more  than  one  rotation  at  a  time  in  triangularizing 

A.   Similar  to  the  parallel  Jacobi  method  [8],  several  schemes  are 

possible.   We  present  an  annihilation  scheme  that  requires  0(n)  steps 

to  produce  D,  R  and  Q  in  (2) .   Denoting  each  element  annihilated  in  the 

k-th  transformation  by  the  integer  k,  the  annihilation  regime  is  shown 

in  Figure  2.   Let  us  denote  by  (i,  i,  j)  that  rotation  which  rotates 

rows  i  and  i,  annihilating  the  element  in  position  (i,  j).   The  k-th 

transformation  consists  of  the  rotations  (i,  i+1,  j),  i=i+l,  where  i 

and  j  are  given  by  the  sequences : 

(i)   for  k  =  1,  2,  . . .,  n-1 


i  =  [(n-k),  (n-k+2),  ...,  (n-l-S(k))} 
J  =  (1,2,  ...,  |~k/2~|} 


(7) 


(ii)   for  k  =  n,n+l,  ...,   2n-3 


i  =  {(k-n+2),  (k-rw-10,  •••,  (n-l-8(k))) 
j  =  {(k-n+2),  (k-n+3),  ...,   I  k/2~]  } 


(8) 


where  s(k)  =  0  for  k  odd,  and  1  for  k  even. 
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■\ 


r  1  1 


\r=l,2,...,t 


J 


q.     =        £  k 


Uj_       u1         (  k  ux         ux     +  t  v1       v1) 


-1 


p  =  k  k 


u 


1  ux 


(k  u^  u-j_  +   i     V]_  Vq_   ) 


-1 


Figure  1. 
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Since  it  is  only  an  odd  numbered  transformation  that  introduces 

a  zero  in  the  last  row,  the  total  number  of  required  transformations  is 

1 
(2n-3).  Hence  in  computing  D2  and  R  in  (2)  we  require  T   =  U(2n-3) 

Pl 
steps  and  one  square  root  with  the  number  of  processors  not  exceeding 


Pi  ■ 


n 
Z 


(21*3)  =  In/2 
1    t=n-|  n/2  +1  l.      . 


(2(n+2) 


n/2 


}• 


For  n=2  ,  where  m  is  a  positive  integer,  p  =  4  n  +  2n. 

The  orthogonal  matrix  Q,  is  given  by  Q,  =  Q^    . . .  (^  £L  ,  where 

each  transformation  0  is  the  direct  product  of  the  rotations  (i,i+l,j) 

given  by  the  sequences  (7)  and  (8).   That  is,  &  is  a  block-diagonal 

matrix  consisting  of  2  X  2  diagonal  submatrices.   Thus,  while  we  do  not 

form  Q  explicitly,  Q  consists  of  a  total  of  n(n-l)/2  rotations  all  of 

which  can  be  evaluated  from  (3)  in  T   =3  steps  plus  one  square  root, 

P2 
using  p  =  n(n-l)  processors.   Therefore,  the  total  parallel  time  for 

obtaining  D2,R,  and  Q,  is  given  by  T  =  T   +  T   =  (8n-9)  steps  and  two 


1 


P, 


square  roots  with  p  =  n(n-l)  processors. 

On  a  serial  machine  the  time  required  for  obtaining  D2   and 

R  in  (2),  i.e.,  evaluating  (5')  and  (6')>  is  given  by, 

n 
Tl  =  t^2  (t_l)  (UU5) 


n(n-l)  (8n+23)/6 
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steps  and  n  square  roots.   While  the  time  required  for  evaluating  the 
n(n-l)/2  rotations  in  Q  is  given  by  T"  =  7n(n-l)/2  steps  and  n(n-l) 

square  roots,  as  can  be  shown  from  (3)  and  (4).   Thus  the  total  sequential 

2 
time  is  T  =  T'  +  T''  =  2n(n-l)  (2n+ll)  /3  steps  and  n  square  roots,  and  we 

?  18 

achieve  a  speedup  Sp  =  n  /6  with  an  efficiency  •?   <  E  <_  rr  .   Hence  we 

proved  the  following  theorem. 

Theorem  k 

Let  A  be  a  nonsingular  matrix  of  order  n,  then  it  can  be 
factored  in  the  form, 

QA  =  D2  R 

1 

where  D2 ,  R,  and  Q. (i=l,2,  . ..,  2n-3)  may  be  evaluated  in  T  =  0(n) 

2 
steps  and  two  square  roots  using  p  =  0(n  )  processors.   This  results 

2  18 

in  a  speedup  Sp  =  0(n  )  with  an  efficiency  -r  <   E  <^  -rr-  . 

We  have  also  developed  an  annihilation  scheme  that  requires 
2(n-l)  -  log  n  transformations  instead  of  (2n-3) ,  however  the  relation 
between  i,  i,  and  j  are  more  complicated  than  that  in  (7)  and  (8). 
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Lemma  2 


1 

Given  the  orthogonal  factorization  Q,  A  =  D2  R,  i.e.,  D,  R, 


and  Q.   The  linear  system  of  equations  Ax  =  f  may  be  solved  in 

T  =  0(n)  steps  and  one  square  root  using  no  more  than  p  =  n  processors. 
Hence  we  obtain  a  speedup  Sp  =  0(n)  with  an  efficiency  —  <  E     <_  y  . 

Proof: 

From  (2)  we  have 

D2  Rx  =  Qf  (9) 


Since  the  matrices  Q.  in  Q  =  Q^    ...  Q,  Q  are  block  diagonal,  the 
right  hand  side  f  =  Qf  may  be  computed  sequentially, 


f,,-,  =  Q,  f,     0  =  1,2,  ...,  2n-3  (10) 

where  f  =  f  and  f     =  f.   Using  n  processors,  we  require  three  steps 
for  each  f.  ,.   Thus  f  may  be  computed  in  3(2n-3)  steps  using  n  pro- 
cessors.   Solving  for  z  in  D2z  =  f  requires  one  division  with  n  pro- 
cessors.  Finally  the  solution  vector  x  in  Rx  =  z  can  be  obtained  in 
3(n-l)  steps  using  (n-1)  processors,  (note  that  R  is  not  unit  upper 
triangular).   Thus  the  total  time  for  solving  (9)  is  given  by, 
T  =  (9n-ll)  steps  with  p  =  n  processors.   The  sequential  algorithm 

requires,  T-  =  2n(2n-l)  steps.   Thus  we  obtain  a  speedup  Sp  -   4n/9 


4       6 
with  an  efficiency  —  <  E  <  ■=■ 

9    p  —  7 


Let, 


Next  we  consider  Givens  reduction  of  tridiagonal  matrices 


20 


A  = 


ai       bi 


e     a  ,   b   , 
n-1   n-1   n-1 


n     n 


,m 


be  of  order  n=2  ,  where  m  is  a  positive  integer.   We  annihilate  the 
subdiagonal  elements  of  A  by  applying  the  orthogonal  transformation 


Q  =  Q 


n-1 


Q  Q  ,  where  each  Q .  is  a  Givens  transformation  rotating 


rows  j  and  j+1  such  that  the  element  in  the  position  (j+1,  j)  is 
eliminated, 
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s  . 


row  j 


V 


-s . 
J 


c. 

J 


Thus,  after  step  (r-1) ,  Q  ,  Q  _2  ...  Q..  A  is  given  by, 


r-1 


'r+1 


r+1 


c   n  b 
r-1  r 


r+1      r+1 


and  by  induction  it  follows  that, 
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c  =  p  /a   ,    s  =  e  , ./a  (11) 

r    r  r       r    r+1  r 


p  =  c   na  -c   0s   -b   - 
r    r-1  r    r-2  r-1  r-1 


I     =   s  a  .-  +  c  ..c  b      r  -  1,  2,  ...,  n-1   (12) 
r    r  r+1    r-1  r  r 


Y  =  s  b   . 
r    r  r+1 


where  c  =  1,  s  =0.   From  (11)  and  the  first  equation  in  (12)  we 
o       o 


obtain  the  recurrence  relation, 


a  c  =  c  ..  a  -  c  0(e  b  _)a  ,  (13) 

r  r    r-1  r    r-2  r  r-1  r-1 


Let  x,  =  c.   II  a.  ;  then  (13)  reduces  to, 
k    k  .  n   i 
i=l 


To  =  X    '        Tl  =  al 


x  =  a  t  .  -  (e  b  ..)t  »    r  =  2,  3,  ...,  n-1        (14) 
r    r  r-1     r  r-1  r-2 


This  can  be  expressed  as  the  system  of  linear  equations, 
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-a. 


62bl 


63b2 


-a. 


e   ,13 
n-1  n 


■a  .    1 
n-1 


l2 
T3 


n-1 


The  matrix  of  coefficients  can  be  formed  in  one  step  using  (n-2) 
processors.   The  system  itself  can  be  solved  [A],  in  (3  log  n-1)  steps 

using  no  more  than  An  processors.   Thus  obtaining  the  vector  t  requires, 

2    2 
T   =3  log  n  steps,  with  p  =  An  processors.   Since  c,  +  s,  =  1,  then 

P  -  X  IC      K 

2    k   2 
defining  9,  =  IT  a.  ,  we  obtain  the  recurrence  relations, 

k   i=l  ± 


«2    2 
6  =  t  =  1 
o    o 


s2  2    2      2 

\  =   ek+l  6k-l  +  Tk 


k  =  1 ,  2 ,  . . . ,  n-1 


(15) 


where,  we  have  used  (11).   The  corresponding  system  of  equations  is, 
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-e, 


-e. 


-e 


n 


r 

1 

i 

e2 

O 

2 

2 

h 

2 

Tl 

2 
92 

• 

2 
T2 

• 

• 

• 

• 

• 

• 

4l 

2 

Tn-1 

- 

The  matrix  of  coefficients  and  the  right-hand  side  can  be  formed  in 

one  step  using  2(n-l)  processors,  and  the  bidiagonal  system  is  solved 

2 
in  2  log  n  steps  with  (n-1)  processors.   Thus  the  elements  9.  , 

i  =  1,  2,  ...,  n-1,  can  be  obtained  in  T   =1+2  log  n  steps  with 


p9  =  2(n-l)  processors.  Now  we  obtain  c  ,  a.,  and  s.  from, 
z  J   J       J 


2    2,2     2    2,2       2    2.2 
c.  =  ■c./e*  »  a4  «=  e./e,  .  ,  s.  =  e   /a. 


(16) 
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in  two  steps  and  one  square  root  using  3(n-l)  processors.   Once  these 

are  obtained,  the  elements  ft,,  y.»  and  a  ■  p  in  (12)  of  the  upper 

j   j       n   rn 

triangular  matrix  are  obtained  in  three  steps  using  2n  processors.   Thus 

the  total  time  required  for  obtaining  the  upper  triangular  matrix  and 

the  rotations  (c  ,  s.,  j  =  1,  2,  ...,  n-1)  is  T  =6+5  log  n  steps  and 

one  square  root  using  no  more  than  4n  processors. 

It  can  be  shown  [9]  that  the  orthogonal  matrix  Q  is  of  the  form, 


Q(u,v,w)  = 


U1V1 


Vi 


u  ..v.. 
n-1  1 


U  V        w 
2  2       2 


n-1  2 


u  vn       u  V- 
n  1       n  2 


u  nv  . 
n-1  n-1 


u   v  , 
n   n-1 


w 


n-1 


u  v   i 
n  n  i 


where  the  vectors  u,  v,  and  w  are  given  by, 


u.  =  c.  n-  -i   3        v.  -  v,.  n/  M. 

J    J   0-1      J    0-1'  0 


w.=s.     j=l,2,  ...,  n-1 
J    J 


=  c.i-l/ni-l     j  =  1,  2,  ...,  n 


(17) 


■where,  c  =  c  =  1  and, 
'      o    n 
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\  =   1 


i   j 
n.  =  (-DJ  n  s      j  =  1,  2,  ...,  n-l         (18) 


k-1  k 


The  matrix  Q  is  not  formed  explicitly,  but  is  represented  by  the  vectors 
u,  v,  and  w.   The  vector  n  is  obtained  in  log  n  steps  using  (n-2)/2 
processors,  hence  u  and  v  are  obtained  in  one  step  using  (2n-3)  pro- 
cessors. We  are  now  in  a  position  to  introduce  the  following  theorem. 


Theorem  5 


Let  A  be  a  nonsingular  tridiagonal  matrix  of  order  n=2  ,  for 


a  positive  integer  m.   Then  the  system  of  equations  Ax=f  may  be  solved 
using  the  orthogonal  factorization  QA=R,  where  R  is  upper  triangular  of 
band  width  3,  in  T  =10  log  n  +  0(l)  steps  with  p  =  kn   processors. 
The  resulting  speedup  and  efficiency  are  of  the  same  order  as  in 
Theorem  3« 

Proof: 

The  determination  of  R  and  Q  required  (7+6  log  n)  steps 
and  one  square  root  using  no  more  than  Hn  processors.   Now  the  system 
to  be  solved  is  Rx  =  Q,f.   f  =  Qf  may  be  expressed  as, 

f  =  (ULV  +  W)f 
where, 
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and 


U  =  diag(u.)  ,   V  =  diag(v.) 


W  =  diag(\*1,  w2,  . ..,  wn_1,  0), 


'  1   i>j 
I.  .   = 
1J    '  0   i<j 


Thus  f  can  be  evaluated  in  (U  +  log  n)  steps  using  n  processors. 
Rx=f  can  be  solved,  [4],  in  3  log  n  steps  using  kn   processors.   There- 
fore, the  total  time  for  solving  Ax=f  is  T  =  (.11  +  10  log  n)  steps 
and  one  square  root  with  p  =  Un  processors.   The  sequential  algorithm 
requires  3(8n-9)  steps  and  (n-l)  square  roots.  Hence  we  obtain  a 

speedup  S  ~  12  n/5  log  n  -with  a  corresponding  efficiency,  E  -   3/5  log  n, 
p  -     /  P 
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